Pricing

from $3.50 / 1,000 results

Wayback Machine URL Extractor - Archived URLs

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Wayback Machine URL Extractor 🕰️ — Archived URLs from the Internet Archive

Recover every historical URL a website has ever published — straight from the Internet Archive's Wayback Machine. This Wayback Machine scraper queries the public CDX API to extract archived URLs and historical URLs for any domain — including pages that were deleted, renamed, or lost in a migration. Feed in one domain and get back up to tens of thousands of unique URLs, each with its capture date, archived HTTP status, MIME type, and a direct Wayback snapshot link.

Point it at one domain and it pulls the full historical URL inventory automatically. No API key, no login, no rate-limit headaches — one row per archived URL.

Looking to recover old URLs after a site migration, build a redirect map, find old/deleted pages, do OSINT on a domain's history, or pull a list of Internet Archive URLs without writing CDX queries by hand? This is the Internet Archive URL extractor that does it at scale.

✨ Key features

🕰️ Full historical URL inventory — pulls every unique URL the Wayback Machine has on record for a domain, going back to 1996.
🔑 No API key required — uses the open Internet Archive CDX API; no auth, no token, no login.
🌐 Subdomain & path matching — capture the host plus all subdomains and paths, or narrow down to a single host or path prefix.
📅 Date-range filtering — restrict to snapshots captured between two dates (fromDate / toDate).
✅ Status-code filtering — keep only 200 OK captures and drop dead/redirected ones.
🔗 Direct snapshot links — every row includes a ready-to-open web.archive.org/web/... URL.
🌊 Streamed pagination — pages through massive result sets with the CDX resumeKey mechanism, so memory stays flat even on 100k+ URL domains.
🔢 Result caps — set maxResults per domain, or 0 for unlimited.
📋 Multiple domains per run — process a whole list in one go.
📤 Export-ready — JSON, CSV, and Excel output via the Apify Dataset or REST API.

💡 Use cases

SEO migration & redirect maps — recover lost/old URLs after a site move and rebuild a complete 301 redirect map so you don't lose link equity.
Content recovery — find and restore blog posts, product pages, or docs that were deleted but still live in the archive.
OSINT & research — enumerate a target domain's historical footprint, old endpoints, removed pages, and forgotten subdomains.
Link reclamation — find old URLs that still earn backlinks so you can redirect them and reclaim the link value.
Finding old endpoints — surface admin paths, legacy APIs, and orphaned pages that no longer appear on the live site.
Competitive & web-archaeology research — reconstruct how a competitor's URL structure and content changed across years of snapshots.
Datasets — build a domain's URL/MIME/capture-history dataset for analysis.

📦 What you get

One row per unique archived URL, including:

Field	Description
`domain`	The normalized domain this URL belongs to
`url`	The original archived URL
`timestamp`	Raw 14-digit Wayback capture timestamp (`YYYYMMDDhhmmss`)
`capturedAt`	ISO 8601 form of the capture timestamp
`statusCode`	HTTP status the archive recorded for that capture (e.g. `200`, `301`, `404`, or `-`)
`mimeType`	Content type recorded at capture time (e.g. `text/html`)
`digest`	Wayback content digest (used internally for de-duplication)
`snapshotUrl`	Direct link to the archived snapshot on `web.archive.org`

Example output

{
  "domain": "nasa.gov",
  "url": "http://www.nasa.gov/mission_pages/station/main/index.html",
  "timestamp": "20120114043915",
  "capturedAt": "2012-01-14T04:39:15.000Z",
  "statusCode": "200",
  "mimeType": "text/html",
  "digest": "AB23CD45EF67GH89IJ01KL23MN45OP67",
  "snapshotUrl": "https://web.archive.org/web/20120114043915/http://www.nasa.gov/mission_pages/station/main/index.html"
}

🚀 How to use it

Click Try for free / Start.
Add one or more domains to Domains (e.g. nasa.gov, bbc.com). URLs and www. are normalized automatically.
(Optional) Pick a matchType, set a date range, filter by status code, or raise maxResults (0 = unlimited).
Click Save & Start.
Export the archived URL list as JSON, CSV, Excel or via API, and open any row's snapshotUrl to view the archived page.

⚙️ Input

Field	Type	Description	Default
`domains`	array	Required. One or more domains or URLs (e.g. `nasa.gov`, `bbc.com`). Wildcards added automatically	–
`matchType`	enum	`subdomains` (host + all subdomains + paths), `host` (exact host only), `domain` (host + subdomains), `prefix` (path prefix)	`subdomains`
`fromDate`	string	Optional `YYYYMMDD` lower bound on capture date	–
`toDate`	string	Optional `YYYYMMDD` upper bound on capture date	–
`filterStatus`	string	Optional — only return captures with this HTTP status (e.g. `200`)	–
`maxResults`	integer	Max unique URLs per domain. `0` = unlimited	`5000`
`proxyConfiguration`	object	Proxy settings. Defaults to Apify Proxy	Apify Proxy

Example input

{
  "domains": ["nasa.gov"],
  "matchType": "subdomains",
  "fromDate": "20100101",
  "toDate": "20201231",
  "filterStatus": "200",
  "maxResults": 5000,
  "proxyConfiguration": { "useApifyProxy": true }
}

🔍 How it works

Each domain you provide is normalized — scheme, www., paths and wildcards are stripped down to a bare host.
A CDX API query is built from your matchType, date range, and status filter, requesting the original, timestamp, statuscode, mimetype and digest fields with collapse=urlkey so each URL appears only once instead of returning every capture of it.
Results are paged using the CDX showResumeKey / resumeKey mechanism, and each page is pushed to the dataset in a batch — so even domains with hundreds of thousands of archived URLs stream out without exhausting memory.
For every row, a direct snapshotUrl is constructed in the https://web.archive.org/web/<timestamp>/<original-url> form, so you can open the exact archived page.
Slow responses, 5xx, and 429 errors are retried with exponential backoff on a fresh proxy IP — the CDX index can be slow, so retries keep large runs reliable.

🧰 Tips & best practices

Big domains (news sites, government sites) can have hundreds of thousands of archived URLs. Start with the default maxResults of 5000 to gauge volume, then raise it or set 0 for everything.
Use filterStatus: "200" to skip dead and redirected captures and keep only pages that actually resolved — ideal for building redirect maps.
Narrow with fromDate / toDate (both YYYYMMDD) when you only care about a specific era of the site.
Use matchType: "subdomains" to sweep every subdomain at once, or host for a single host without its subdomains.
Sort or filter the dataset by mimeType to isolate just HTML pages, images, PDFs, etc.

❓ FAQ

How do I get all URLs of a website from the Wayback Machine?

Add the domain to Domains, leave matchType on subdomains, set maxResults to 0 for everything, and run it. The actor queries the Internet Archive CDX API and returns one row per unique archived URL.

Can I find old or deleted pages of a domain?

Yes — that's the core use case. The Wayback Machine keeps URLs even after they're removed from the live site, so deleted blog posts, retired product pages, and old endpoints all show up in the results with a snapshotUrl to view them.

How do I export archived URLs to CSV or JSON?

Run the actor, then download the dataset as CSV, JSON or Excel (or pull it via the REST API). Every archived URL is one row, so it drops straight into a spreadsheet or pipeline.

Is this free and without an API key?

The Internet Archive CDX API is public and requires no API key and no login. You only pay for the Apify platform usage of the run itself.

Can I filter by date or status code?

Yes — set fromDate / toDate (YYYYMMDD) to restrict to a capture window, and filterStatus (e.g. 200) to keep only captures with a specific HTTP status.

How many URLs can it return?

Up to tens of thousands per domain — set maxResults to 0 for unlimited. Results stream to the dataset in pages via the CDX resumeKey, so even 100k+ URL domains run without memory issues.

Why are some `statusCode` values `-`?

The Wayback index sometimes records captures without a stored status code (e.g. revisit records). Those rows are still valid archived URLs.

Sitemap to URL Crawler — extract all URLs from any sitemap.xml.
Website SEO Audit Crawler — run a full on-page SEO audit across a whole site.
Bulk URL Status Checker — check HTTP status codes for a list of URLs in bulk.
Broken Link Checker — crawl a site and find dead links with HTTP status codes.

📝 Changelog

2026-06-15

Initial release — extract archived URLs from the Wayback Machine CDX API with date/status filters, CSV/JSON export, no API key.

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

ParseForge

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

🕰️ Wayback Machine Bulk Checker

taroyamada/wayback-machine-checker

Check bulk lists of URLs against the Internet Archive database to instantly verify cache availability. Automate historical web page discovery for large sites.

naoki anzai

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

knotless_cadence/wayback-machine-scraper

Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Wayback Machine Archive Scraper

andok/wayback-machine-scraper

Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.

Andok

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Stas Persiianenko

🗂️ Google Cache Viewer — Wayback + Archive Alternative

nexgendata/google-cache-viewer

Replaces Google's cached-page view (killed Feb 2024). Queries Wayback Machine + archive.today, returns latest snapshot URL, timestamp, and extracted text content.

NexGenData

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Internet Archive Scraper

automation-lab/internet-archive-scraper

Search and extract metadata from the Internet Archive. Find books, videos, audio, software, and more from 40M+ items.